[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution by zhengruifeng · Pull Request #55947 · apache/spark

zhengruifeng · 2026-05-18T04:04:57Z

What changes were proposed in this pull request?

This PR widens test coverage of how a tagged DataFrame column reference (df.col / df["col"], which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of spark.sql.analyzer.strictDataFrameColumnResolution on Spark Connect.

The new tests are added to the shared ColumnTestsMixin in python/pyspark/sql/tests/test_column.py, so each one runs in three environments:

ColumnTests - Classic (default strictDataFrameColumnResolution=true),
ColumnParityTests - Connect strict,
ColumnParityTestsWithNonStrictDFColResolution - Connect lenient (extends the strict suite).

The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (python/pyspark/sql/tests/connect/test_parity_column.py) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review.

Per-operator resolution tests (added to ColumnTestsMixin):

Pass-through (filter, sort, distinct) - all modes resolve.
Attribute-id propagation (groupBy().count(), pivot, intersect, temp-view roundtrip) - all modes resolve; the source attribute id flows through.
Removal (withColumnRenamed, drop) - all modes raise (column gone by id and by name).
Shadowing (withColumn chain, select + alias, agg + alias) - Classic and Connect strict raise; Connect lenient resolves the shadowed name via name-based fallback (overridden in the lenient suite).
Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient).
Self-join - the alias form is ambiguous and raises in all modes; the documented withColumnRenamed form resolves in all modes (the renamed side is filtered out during disambiguation).
Cross-DataFrame illegal reference (df1.select(df2.col)) - raises in all modes; the throw is not gated by the strict/lenient switch.
df["*"] star expansion and sort-missing-attribute recovery - resolve in all modes.

Mixed-surface layered programs (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both filter and select. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve.

Why are the changes needed?

#55531 added the strictDataFrameColumnResolution config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

New tests, run locally across all three suites:

python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests"
python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng · 2026-05-22T01:37:45Z

@vladimirg-db @cloud-fan would you mind taking a look? thanks

cloud-fan · 2026-05-22T09:41:57Z

Shall we do some investments on the test framework first? It's not the first time we want test parity between classic and connect, for both scala and python APIs, but duplicating the test case is not a good long term solution. How can we define the base test cases in the base suite, and extend the suite in both classic and connect with their dedicated test cases?

zhengruifeng · 2026-05-22T12:51:26Z

@cloud-fan this PR is not just for parity (thought some tests behave the same on both connect and classic).

E.g.

    def test_resolve_after_select_alias_shadow(self):
        # Same shadowing shape as withColumn but expressed through a select
        # with alias.
        # Classic: fails.
        sdf = self.spark.sql("SELECT 1 AS c")
        with self.assertRaises(AnalysisException):
            sdf.select(sdf["c"].cast("string").alias("c")).select(sdf["c"]).collect()

        # Connect strict: fails.
        with self.connect_conf({"spark.sql.analyzer.strictDataFrameColumnResolution": True}):
            cdf = self.connect.sql("SELECT 1 AS c")
            with self.assertRaisesRegex(AnalysisException, "CANNOT_RESOLVE_DATAFRAME_COLUMN"):
                cdf.select(cdf["c"].cast("string").alias("c")).select(cdf["c"]).collect()

        # Connect lenient: succeeds via name-based fallback.
        with self.connect_conf({"spark.sql.analyzer.strictDataFrameColumnResolution": False}):
            cdf = self.connect.sql("SELECT 1 AS c")
            cdf.select(cdf["c"].cast("string").alias("c")).select(cdf["c"]).collect()

…esolution Add parity tests in test_connect_column.py and layered programs in test_parity_dataframe.py to lock in known Connect/Classic behavior differences in DataFrame column resolution. Tests cover shadowing, pass-through, aggregation, pivot, set ops, self-join, subquery-as-table patterns under both strict and non-strict modes of spark.sql.analyzer.strictDataFrameColumnResolution, plus three mixed-surface layered programs that combine filters, joins, aggregations, set ops, window functions, UDFs and temp views. Generated-by: Claude Code (Anthropic), claude-opus-4-7

…sic baselines Move the focused parity tests and the layered programs into a dedicated SparkConnectColumnResolutionTests class in test_connect_column.py. Each test now starts with a Classic baseline against self.spark to document how Spark Classic handles the same pattern, followed by Connect strict and Connect lenient blocks. test_parity_dataframe.py is reverted to its prior state. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Replace the three shallow layered programs with Reyden-influenced fixtures that mirror the patterns referenced in SC-229895 (reyden/query-tests/golden-files/layered-query-tests): - 4-level subquery chain with windows, HAVING and correlated EXISTS - CTE chain with GROUPING SETS, NTILE, struct field access and correlated IN - Self-join via SQL with windowed running totals, correlated EXISTS and UDF wrapping Each program builds the deeply layered base via spark.sql(), then layers DataFrame-API shadowing on top with a tagged df["c"] reference at the outermost select. Classic and Connect strict raise; Connect lenient succeeds via name-based fallback. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Generated-by: Claude Code (Anthropic), claude-opus-4-7

The Connect-specific AnalysisException subclasses the base one, so a single import covers both. The assertRaisesRegex regex still pins the Connect-specific CANNOT_RESOLVE_DATAFRAME_COLUMN error class. Generated-by: Claude Code (Anthropic), claude-opus-4-7

…y assertions - Rewrite the three mixed-surface layered tests to use only the DataFrame API (chained transformations, semi-joins, Window functions, cube, UDFs and struct field access). spark.sql() is no longer used in the layered tests. - After running the suite locally, correct several assertions that assumed Classic-vs-Connect divergence where none exists: groupBy, intersect, pivot, and temp-view-roundtrip all preserve attribute-id propagation through Connect's plan-id resolver, so the tagged reference resolves in both strict and lenient modes. - For union the divergence does hold: Classic succeeds but Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both strict and lenient modes (the name-based fallback is not triggered for set-op outputs). Generated-by: Claude Code (Anthropic), claude-opus-4-7

Rename df / df1 / df2 variables that refer to Connect DataFrames in SparkConnectColumnResolutionTests to cdf / cdf1 / cdf2, matching the existing sdf convention for Classic. Drop the per-test pyspark.sql.types imports from the layered tests; the needed types are already imported at the top of the file. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Based on the resolution logic in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala add four tests covering behaviors not previously exercised: - test_resolve_cross_dataframe_illegal_reference: the documented df1.select(df2.a) case where df2's plan id is not in df1's plan tree; fails with CANNOT_RESOLVE_DATAFRAME_COLUMN before any name-based fallback can run. - test_resolve_df_star: plan-id-tagged star expansion via UnresolvedDataFrameStar; succeeds in both Classic and Connect modes. - test_resolve_self_join_withcolumnrenamed: the documented self-join disambiguation example where the right-side candidate is filtered out by the rename projection above it. - test_resolve_sort_missing_attr_recovery: the documented df.select(df.v).sort(df.id) case where Sort's reference is recovered by resolveExprsAndAddMissingAttrs adding id back to the upstream projection, in both strict and lenient modes. Tighten comments on two existing tests: - test_resolve_after_union: cite that Union is treated as a leaf node when walking the plan tree for plan-id resolution. - test_resolve_self_join_alias: cite AMBIGUOUS_COLUMN_REFERENCE as the specific failure mode. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Reframe the 3 layered DataFrame tests around ``layered[col]`` usage in filter and select at the outermost surface (no synthetic shadow on top), and move them into the shared ``ColumnTestsMixin`` so they run under Classic (``ColumnTests``), Connect strict (``ColumnParityTests``), and Connect lenient (``ColumnParityTestsWithNonStrictDFColResolution``) via the existing parity framework instead of building separate Classic and Connect arms in one test body. Generated-by: Claude Code (Anthropic), claude-opus-4-7

Reference DataFrame columns via the ``df.col_name`` getattr form instead of ``sf.col(...)`` / ``df["col"]`` in the three layered tests, binding the post-agg intermediates to variables so a DataFrame is in scope for the HAVING-style filters. Struct subfields keep bracket access because ``detail.name`` would resolve to ``Column.name``. Generated-by: Claude Code (Anthropic), claude-opus-4-8

Move the tagged DataFrame column-resolution tests out of the Connect-only SparkConnectColumnResolutionTests and into the shared ColumnTestsMixin, so each runs under Classic (ColumnTests), Connect strict (ColumnParityTests), and Connect lenient (ColumnParityTestsWithNonStrictDFColResolution). The mixin body asserts the behavior shared by Classic and Connect strict. The diverging cases are overridden in the parity suites: - union: ColumnParityTests asserts Connect raises CANNOT_RESOLVE_DATAFRAME_COLUMN in both modes (Classic resolves via attribute-id propagation in the mixin); - the shadowing trio: the lenient suite asserts name-based fallback succeeds. Tagged references use the df.col getattr form; df["*"] keeps bracket access and untagged transform expressions keep sf.col. Drops the now-unused SparkConnectColumnResolutionTests and its AnalysisException import. Generated-by: Claude Code (Anthropic), claude-opus-4-8

Move ``from pyspark.sql.window import Window`` out of the three layered test methods to the module-level imports. Add a comment on each shadowing resolution test noting that Connect lenient mode diverges (name-based fallback succeeds, overridden in the lenient parity suite). Generated-by: Claude Code (Anthropic), claude-opus-4-8

df.c on a DataFrame with only column `x` raises PySparkAttributeError before reaching the analyzer. Use df["c"] so the tagged UnresolvedAttribute is built eagerly and the AnalysisException is thrown at analysis time, which is what this test pins. Same fix applied in the lenient parity override. Generated-by: Claude Code (Anthropic), claude-opus-4-8

SELECT 1 AS x gave a df with no column c, so df.c raised PySparkAttributeError before Spark analysis. Change the source to SELECT 1 AS c and shadow via agg(sum("c").alias("c")) so df.c is valid and the tagged reference is aggregated away - same semantics, correct form. Generated-by: Claude Code (Anthropic), claude-opus-4-8

Both Union and Intersect preserve the left child's exprId (Union/Intersect.mergeChildOutputs), so neither "emits new attribute ids". The actual Connect divergence is that Union is treated as a leaf during plan-id resolution (ColumnResolutionHelper) while Intersect is not. Co-authored-by: Isaac

…L) group CUBE produces subtotal groups; ROLLUP is a distinct grouping operator, so 'subtotal' describes the (category, status, NULL) group more precisely. Co-authored-by: Isaac

cloud-fan · 2026-05-31T12:41:01Z

thanks, merging to master/4.x/4.2 (test only)

…ame column resolution ### What changes were proposed in this pull request? This PR widens test coverage of how a tagged DataFrame column reference (`df.col` / `df["col"]`, which carries the source DataFrame's plan id) resolves after a range of operators, pinning the behavior across Spark Classic and both modes of `spark.sql.analyzer.strictDataFrameColumnResolution` on Spark Connect. The new tests are added to the shared `ColumnTestsMixin` in `python/pyspark/sql/tests/test_column.py`, so each one runs in three environments: - `ColumnTests` - Classic (default `strictDataFrameColumnResolution=true`), - `ColumnParityTests` - Connect strict, - `ColumnParityTestsWithNonStrictDFColResolution` - Connect lenient (extends the strict suite). The base mixin asserts the behavior shared by all three. The few cases where Connect diverges are overridden in the Connect parity suites (`python/pyspark/sql/tests/connect/test_parity_column.py`) rather than duplicated, so the shared cases stay mode-agnostic. This follows the base-suite + per-mode-override structure suggested in review. **Per-operator resolution tests** (added to `ColumnTestsMixin`): - Pass-through (`filter`, `sort`, `distinct`) - all modes resolve. - Attribute-id propagation (`groupBy().count()`, `pivot`, `intersect`, temp-view roundtrip) - all modes resolve; the source attribute id flows through. - Removal (`withColumnRenamed`, `drop`) - all modes raise (column gone by id and by name). - Shadowing (`withColumn` chain, `select` + alias, `agg` + alias) - Classic and Connect strict raise; **Connect lenient** resolves the shadowed name via name-based fallback (overridden in the lenient suite). - Union - Classic resolves the left-side reference (Union keeps the left child's attribute ids); **Connect** raises `CANNOT_RESOLVE_DATAFRAME_COLUMN` in both modes, because Union is treated as a leaf during plan-id resolution (overridden in the strict suite, inherited by lenient). - Self-join - the alias form is ambiguous and raises in all modes; the documented `withColumnRenamed` form resolves in all modes (the renamed side is filtered out during disambiguation). - Cross-DataFrame illegal reference (`df1.select(df2.col)`) - raises in all modes; the throw is not gated by the strict/lenient switch. - `df["*"]` star expansion and sort-missing-attribute recovery - resolve in all modes. **Mixed-surface layered programs** (3): each chains 4-5 transformations - semi-joins (DataFrame-API EXISTS/IN), window functions, cube aggregation, NTILE, UDFs, struct-field access - and then references the outermost layer's columns in both `filter` and `select`. These exercise plan-id propagation across interacting analyzer rules, which single-operator tests miss. All modes resolve. ### Why are the changes needed? #55531 added the `strictDataFrameColumnResolution` config and a single shadowing test. This PR enumerates the Connect-vs-Classic resolution behavior across shadowing variants, attribute-id-propagating operators, set operations, self-joins, and multi-operator layered pipelines. A future tightening of Connect's column resolution will then surface as a clear test failure rather than a silent regression. ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? New tests, run locally across all three suites: ``` python/run-tests --testnames "pyspark.sql.tests.test_column ColumnTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTests" python/run-tests --testnames "pyspark.sql.tests.connect.test_parity_column ColumnParityTestsWithNonStrictDFColResolution" ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic), claude-opus-4-7 Closes #55947 from zhengruifeng/SC-229895-connect-col-tests. Lead-authored-by: Ruifeng Zheng <ruifengz@apache.org> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 37fcee4) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

zhengruifeng changed the title ~~[TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution~~ [SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution May 18, 2026

zhengruifeng marked this pull request as ready for review May 18, 2026 05:00

zhengruifeng marked this pull request as draft May 18, 2026 05:53

zhengruifeng marked this pull request as ready for review May 20, 2026 08:41

zhengruifeng added 11 commits May 29, 2026 11:00

Drop external-project reference from layered test section comment

3350fa5

Generated-by: Claude Code (Anthropic), claude-opus-4-7

zhengruifeng force-pushed the SC-229895-connect-col-tests branch from ad6758b to 29f715b Compare May 29, 2026 11:00

zhengruifeng and others added 5 commits May 29, 2026 12:32

Reword cube comment: 'rollup' -> 'subtotal' for the (cat, status, NUL…

2f08dc1

…L) group CUBE produces subtotal groups; ROLLUP is a distinct grouping operator, so 'subtotal' describes the (category, status, NULL) group more precisely. Co-authored-by: Isaac

cloud-fan closed this in 37fcee4 May 31, 2026

zhengruifeng deleted the SC-229895-connect-col-tests branch June 2, 2026 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947

[SPARK-56917][TESTS][CONNECT] Expand Connect-specific tests for DataFrame column resolution#55947
zhengruifeng wants to merge 16 commits into
apache:masterfrom
zhengruifeng:SC-229895-connect-col-tests

zhengruifeng commented May 18, 2026 •

edited by cloud-fan

Loading

Uh oh!

zhengruifeng commented May 22, 2026

Uh oh!

cloud-fan commented May 22, 2026 •

edited

Loading

Uh oh!

zhengruifeng commented May 22, 2026 •

edited

Loading

Uh oh!

cloud-fan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengruifeng commented May 18, 2026 • edited by cloud-fan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented May 22, 2026

Uh oh!

cloud-fan commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented May 18, 2026 •

edited by cloud-fan

Loading

cloud-fan commented May 22, 2026 •

edited

Loading

zhengruifeng commented May 22, 2026 •

edited

Loading